This analysis performs sample classification using high-dimension data. It includes two main steps:
The first step uses the t-SNE (t-Distributed Stochastic Neighbor Embedding) to create a 2D or 3D-map from high dimensional data. The Barnes-Hut version of t-SNE is used to reduce computing time, so it can be run for multiple times. The result of t-SNE run having the lowest Kullback–Leibler (KL) divergence will be used for the next step.
The second step uses the DBSCAN method to cluster samples using the t-SNE result from the last step. The two key parameters of DBSCAN, epsilon neighborhood and minimum cluster size are adjusted based on silhouette score to obtain optimal classification.
Additionally, the result of unsupervised clustering is compared to any known sample features to evaluate their association.
Comparison between cell lines from 9 different cancer tissues (NCI-60); GSE5949
Reinhold WC, Reimers MA, Lorenzi P, Ho J et al. Multifactorial regulation of E-cadherin expression: an integrative study. Mol Cancer Ther 2010 Jan;9(1):1-16. PMID: 20053763.
Comparison between cell lines from 9 different cancer tissue of origin types (Breast, Central Nervous System, Colon, Leukemia, Melanoma, Non-Small Cell Lung, Ovarian, Prostate, Renal) from NCI-60 panel
t-SNE (t-Distributed Stochas- tic Neighbor Embedding) is a method of feature reduction that maps high-dimentional data to a 2- or 3-dimensional space. This analysis utilizes the Barnes-Hut version on a data matrix with 60 samples and 17647 features. The method is implemented by the Rtsne {Rtsne} R function.
On a t-SNE space, the similarities between samples is measured by their overall KL (Kullback–Leibler) divergence. An optimal t-SNE run should have relatively lower KL divergence. Each t-SNE run of this analysis involves the following sub-steps:
The steps above are re-run for 100 times, and output from the run with the lowest KL divergence is the final result of t-SNE.
DBSCAN (Density-based spatial clustering of applications with noise) is a sample clustering method based on the density of sample on a multi-dimensional space. This analysis uses the dbscan {dbscan} R implementation of this method. Its input is the result of t-SNE run with the lowest KL divergence and its output is evaluated by the silhouette score of clustered samples. Each DBSCAN run requires the following parameters:
The optimal value of minPts (minimal number of samples in a cluster) is 2. The corresponding clustering result of DBSCAN will be used for the rest of this analysis.
| Cluster_ID | N | Neighbors | Width_Mean | Width_Median | Width_Min | Width_1stQu | Width_3rdQu | Width_Max |
|---|---|---|---|---|---|---|---|---|
| Cluster_0 | 0 | NaN | NA | NA | NA | NA | NA | |
| Cluster_1 | 2 | 4 | 0.86 | 0.86 | 0.85 | 0.85 | 0.86 | 0.87 |
| Cluster_2 | 15 | 4;5;9;11 | 0.37 | 0.43 | -0.13 | 0.30 | 0.51 | 0.58 |
| Cluster_3 | 3 | 10 | 0.94 | 0.95 | 0.93 | 0.94 | 0.95 | 0.95 |
| Cluster_4 | 5 | 1 | 0.63 | 0.68 | 0.41 | 0.64 | 0.68 | 0.74 |
| Cluster_5 | 12 | 1;2 | 0.67 | 0.70 | 0.41 | 0.64 | 0.74 | 0.77 |
| Cluster_6 | 6 | 1 | 0.80 | 0.82 | 0.71 | 0.76 | 0.84 | 0.84 |
| Cluster_7 | 4 | 8 | 0.67 | 0.70 | 0.54 | 0.65 | 0.72 | 0.75 |
| Cluster_8 | 5 | 7 | 0.63 | 0.68 | 0.48 | 0.56 | 0.71 | 0.73 |
| Cluster_9 | 4 | 2;11 | 0.76 | 0.78 | 0.66 | 0.72 | 0.82 | 0.82 |
| Cluster_10 | 2 | 2;5 | 0.96 | 0.96 | 0.95 | 0.95 | 0.96 | 0.96 |
| Cluster_11 | 2 | 9 | 0.91 | 0.91 | 0.91 | 0.91 | 0.91 | 0.91 |
If the samples have been previously labeled with their known features, such as genotype, treatment and disease state, the agreement between these features and DBSCAN classification can be evaluated to discover potential feature-classification association.
Table 2 Summary of cluster-feature association.
| Feature | Num_Group | Rand | C_Rand | ChiSq | P_ChiSq | Count_Expected | Count_Observed | Obs/Exp |
|---|---|---|---|---|---|---|---|---|
| Organ | 9 | 0.86 | 0.350 | 253.8560 | 0.0001 | Table | Table | Table |
| Sex | 3 | 0.58 | 0.039 | 32.9116 | 0.0320 | Table | Table | Table |
| p53_Status | 3 | 0.54 | 0.013 | 22.2591 | 0.3200 | Table | Table | Table |
Check out the RoCA home page for more information.
To reproduce this report:
Find the data analysis template you want to use and an example of its pairing YAML file here and download the YAML example to your working directory
To generate a new report using your own input data and parameter, edit the following items in the YAML file:
Run the code below within R Console or RStudio, preferablly with a new R session:
if (!require(devtools)) { install.packages('devtools'); require(devtools); }
if (!require(RCurl)) { install.packages('RCurl'); require(RCurl); }
if (!require(RoCA)) { install_github('zhezhangsh/RoCAR'); require(RoCA); }
CreateReport(filename.yaml); # filename.yaml is the YAML file you just downloaded and edited
If there is no complaint, go to the output folder and open the index.html file to view report.
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 grid stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] RoCA_0.0.0.9000 flexclust_1.3-4 modeltools_0.2-21
## [4] lattice_0.20-33 cluster_2.0.4 e1071_1.6-7
## [7] gplots_3.0.1 dbscan_0.9-8 Rtsne_0.11
## [10] webshot_0.3.2 plotly_4.5.2 ggplot2_2.1.0
## [13] htmlwidgets_0.7 DT_0.2 awsomics_0.0.0.9000
## [16] yaml_2.1.13 rmarkdown_1.0 knitr_1.14
## [19] RCurl_1.95-4.8 bitops_1.0-6 devtools_1.12.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.6 git2r_0.15.0 highr_0.6
## [4] RColorBrewer_1.1-2 formatR_1.4 plyr_1.8.4
## [7] class_7.3-14 base64enc_0.1-3 tools_3.2.2
## [10] digest_0.6.10 jsonlite_1.0 evaluate_0.9
## [13] memoise_1.0.0 tibble_1.1 gtable_0.2.0
## [16] viridisLite_0.1.3 DBI_0.4-1 curl_1.2
## [19] parallel_3.2.2 withr_1.0.2 stringr_1.0.0
## [22] dplyr_0.5.0 httr_1.2.1 caTools_1.17.1
## [25] gtools_3.5.0 R6_2.1.2 gdata_2.17.0
## [28] purrr_0.2.2 tidyr_0.5.1 magrittr_1.5
## [31] scales_0.4.0 htmltools_0.3.5 assertthat_0.1
## [34] colorspace_1.2-6 KernSmooth_2.23-15 stringi_1.1.1
## [37] lazyeval_0.2.0 munsell_0.4.3
END OF DOCUMENT